Skip to content

perf(vllm): optimize MiniMax M3 inference on MI300X#1782

Draft
Oseltamivir wants to merge 1 commit into
feat/m3-mi300x-mxfp8from
codex/minimax-m3-mi300x-ep-mxfp8
Draft

perf(vllm): optimize MiniMax M3 inference on MI300X#1782
Oseltamivir wants to merge 1 commit into
feat/m3-mi300x-mxfp8from
codex/minimax-m3-mi300x-ep-mxfp8

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • keep this PR stacked on the current [Experimental][DNM till upstream PR merges][AMD] perf: load-time block FP8 MoE for MiniMax M3 on MI300X #1753 head (6f5a3991), which supplies the load-time 128x128 block-FP8 conversion
  • replace the old route-compaction-only patch with the cumulative MI300X runtime patch validated during profiling
  • optimize MiniMax M3 sparse attention, index scoring, FP8 MoE routing/reduction, router projection, and residual collectives
  • enable the pinned AITER Gemma all-reduce + RMSNorm path only for TP8; EP8 remains on the faster native collective
  • use a 32K scheduler token budget only for measured long-prompt points (ISL >= 8192 && CONC >= 16)

This PR contains no profiling configuration and does not modify perf-changelog.yaml.

Profile basis

The final all-rank 8k1k/c256 EP profile shows one kernel stream and no compute/communication overlap window:

Profile Decode step GPU busy Collective + norm
native EP collective 106.324 ms 99.2% 63.723 ms
AITER fused EP collective 109.257 ms 99.1% 66.088 ms

Native EP is 2.7% faster per profiled decode step. Across the profiled 256-request batch it improves output throughput by 6.4%, mean TTFT by 6.8%, and mean TPOT by 2.5%.

The remaining native critical path is:

  • native all-reduce: 62.913 ms
  • block-FP8 MoE experts: 17.361 ms
  • sparse index score: 7.920 ms
  • sparse attention decode: 5.401 ms

The dependencies are serial at the block boundaries, so moving these kernels to another stream would not hide useful work. The implementation instead removes work from each stage and fuses only where the measured dependency permits it.

Optimizations

  • compact EP routes from 128 global experts to the 16 experts owned by each rank
  • tighten route padding and use the route-aware fused MoE reduction
  • defer FFN reductions into the following Gemma RMSNorm boundary
  • use a gfx942 FP32 router projection kernel for the exact MiniMax M3 decode shape
  • tune the MI300X E16 block-FP8 expert configuration
  • replicate the TP input embedding to remove its startup all-reduce
  • use the AITER fused all-reduce + Gemma RMSNorm only on the measured TP8 shape
  • retain native collectives for EP, where the AITER fusion regresses both attention and FFN boundaries

All optimized paths are gated to the profiled MiniMax M3/gfx942 shapes. Other models, platforms, parallel modes, and unsupported shapes retain the existing path.

Performance

MI300X output throughput, aggregate across 8 GPUs:

Configuration Baseline Final Change
8k1k EP8 c256, main run 27510667862 1,066.3 tok/s 1,695.8 tok/s +59.0%
8k1k EP8 c256, regressed run 27569397626 883.9 tok/s 1,695.8 tok/s +91.9%

The action rows use the regular InferenceX sweep request count. The final row is a warmed production-image spot check with 256 fixed-length requests; the same-harness component run improved from 1,391.9 to 1,512.8 tok/s (+8.7%) before the final production-image validation.

Production-image spot checks with the exact committed patch:

Configuration Output throughput
8k1k EP8 c256 1,695.8 tok/s
8k1k TP8 c16 761.5 tok/s
32k1k TP8 c16 402.3 tok/s

Component A/B results:

  • TP8 AITER fusion: +5.1% at 8k1k/c16 and +1.7% at 32k1k/c16
  • 32K scheduler budget: +4.0% at 8k1k TP c16, +4.5% at 8k1k EP c256, and +5.8% at 32k1k TP c16
  • the scheduler override is intentionally disabled for 1k prompts, where it regressed TP c16 by 3.1%

Validation

  • production squash vllm/vllm-openai-rocm:minimax-m3
  • runtime patches apply sequentially over image revision 4a560dd8db67c270f5e2afb614558271b76f2294
  • all 19 generated runtime files match the validated vLLM tree byte-for-byte
  • patched production image served TP8 and EP8 successfully with CUDA graphs
  • pinned AITER source built and initialized its custom communicator on all 8 TP ranks
  • git diff --check
  • bash -n and ShellCheck
  • python -m pytest utils/matrix_logic/ -v: 156 passed
  • MI300X MiniMax 1k1k/8k1k config generation: 36 entries

Note

Medium Risk
Large inference-runtime patch changes MoE routing, collectives, and model forward semantics on a gated path; wrong gating could affect numerics or parallelism, but scope is limited to profiled MiniMax M3 MI300X configurations.

Overview
Adds a second runtime patch (minimaxm3_mi300x_profiled.patch) on top of the existing MXFP8 block-FP8 patch, and refactors the MI300X benchmark script to apply both patches generically, optionally install a pinned AITER build for TP8-only fused all-reduce + Gemma RMSNorm, and pass --max-num-batched-tokens 32768 when ISL >= 8192 and CONC >= 16.

The patch targets profiled MiniMax M3 / gfx942 shapes: EP8 MoE route compaction and tuned block-FP8 expert configs; a gfx942 small-batch router GEMM; Triton tweaks to sparse attention and index scoring; deferred FFN all-reduces fused into the next Gemma norm boundary on TP8; and replicated input embeddings on MI300X TP8 to drop an extra collective. AITER Gemma fusion stays off for EP and non-TP8; native collectives remain there.

Gating is explicit (parallel mode, token counts, hidden size 6144, etc.) so other models and platforms keep prior behavior, with fallbacks to unfused all-reduce + norm where fast paths do not apply.

Reviewed by Cursor Bugbot for commit 87e92e2. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@Oseltamivir Oseltamivir marked this pull request as ready for review June 16, 2026 17:53
@claude

claude Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Claude finished @Oseltamivir's task in 7m 0s —— View job


Review of PR #1782

LGTM — no blocking issues found.

Summary: This is a well-structured incremental optimization for MiniMax M3 EP8 on MI300X. The patch adds two complementary paths (short-context native/BF16 dispatch via the oracle, long-context sparse BF16 emulation with fused SwiGLU), and the gating logic correctly ensures each path is only reached for the profiled configuration. Key observations:

  • Correctness of the fused gated kernel: fused_moe_gated_kernel correctly computes both gate and up projections in a single pass, preserves the BF16 store/reload boundary before applying SwiGLU in FP32 (matching the unfused numerical behavior), and writes per-route rows to the activation buffer. The b_up_ptrs = b_gate_ptrs + N * stride_bn offset correctly addresses the split layout.
  • Local-expert allocation tightening: The num_local_experts parameter in moe_align_block_size correctly reduces the allocation from 128 * (block_size-1) to 16 * (block_size-1) padding tokens. The _max_post_padded bound is sound — min(allocation, M*block_m, M + E*(block_m-1)) takes the tightest of three independent bounds.
  • EP route filtering in emulation path: The _apply_sparse_mi300x_ep method correctly passes ignore_invalid_experts=True and num_local_experts=E to _prepare_expert_assignment, so remote-expert routes are dropped from sorted_token_ids. The downstream moe_fused_mul_sum with apply_weights=False and expert_map correctly zeros contributions from non-local experts (via tl.where(expert_mask, b_val, 0.0)), so unwritten rows in the workspace don't leak garbage.
  • Fused top-k atomic path in native MXFP8: FUSE_TOPK=True uses tl.atomic_add on a zero-initialized output with c_row = offs_token // top_k, correctly accumulating weighted per-route results into per-token output rows. The M_routed % top_k divisibility check guards against misuse.
  • Oracle routing: Short-context EP8 (max_model_len <= 4096) routes to Mxfp8NativeTritonExperts with mixed native/BF16 weights; long-context EP8 routes to Mxfp8EmulationTritonExperts with the sparse BF16 path. Both guards (_should_use_native_ep, _is_minimax_m3_mi300x_ep8) are scoped to the exact MiniMax M3 config.
  • Benchmark script: Patch application follows the existing idempotent marker pattern. EP flag is correctly conditional on EP_SIZE. Server launch uses proper line continuations.
  • Repo guidelines: No master config or perf-changelog changes needed (this PR only touches the benchmark script and runtime patch). No MTP or model prefix concerns.

@Oseltamivir Oseltamivir marked this pull request as draft June 16, 2026 20:26
@Oseltamivir Oseltamivir changed the title perf(vllm): fuse MiniMax M3 BF16 EP experts on MI300X perf(vllm): compact MiniMax M3 EP decode routes on MI300X Jun 16, 2026
@Oseltamivir Oseltamivir marked this pull request as ready for review June 17, 2026 01:09
@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude finished @Oseltamivir's task in 4m 57s —— View job


Review of PR #1782

  • Read repository guidelines (CLAUDE.md)
  • Gather context and read changed files
  • Review benchmark shell script changes
  • Review runtime patch file
  • Validate against PR review checklist
  • Post review summary

LGTM — no blocking issues found.

Summary: This is a well-scoped incremental EP8 decode optimization stacked on #1753. The patch adds two complementary paths: short-context EP8 routes to Mxfp8NativeTritonExperts with the profiled native/BF16 dispatch; long-context EP8 routes to Mxfp8EmulationTritonExperts which now compacts decode batches (≤256 tokens) to locally-owned experts with BM16 tiles, bypassing the overpadded generic path. Key observations:

  • Local-route compaction in _apply_sparse_mi300x_ep: The _prepare_expert_assignment call with ignore_invalid_experts=True and num_local_experts=E correctly drops remote-expert routes from sorted_token_ids. Buffer management is sound — workspace2 is reused for intermediate_gate_up (GEMM1 output) and intermediate_output (GEMM2 output) with the activation fully consuming the former before the latter is written. GEMM2 passes top_k=1 (treating each route as an independent token) with mul_routed_weight=True, then moe_fused_mul_sum with apply_weights=False sums per-token without double-applying weights, zeroing non-local experts via expert_map.
  • _max_post_padded bound: The three-way min(allocation, M_routed * block_m, M_routed + E * (block_m - 1)) is sound — the first is the buffer size, the second the original bound, the third the tighter per-expert padding bound. The block-alignment floor is correct.
  • moe_align_block_size tightening: When ignore_invalid_experts, expert_map, and num_local_experts are all set, padding allocation drops from global_experts * (block_size - 1) to local_experts * (block_size - 1). The 0 < num_local_experts <= num_experts validation prevents misuse.
  • Fused top-k atomic in _mxfp8_grouped_gemm_*_kernel: c_row = offs_token // top_k correctly maps route-indexed rows to token-indexed output, tl.atomic_add with zero-initialized output accumulates concurrent routes, and the M_routed % top_k divisibility check is validated before launch.
  • Route-aware SwiGLU kernel (_swiglu_oai_quant_routed_kernel): Processes only locally-routed rows via sorted_token_ids, with proper padding/remote masking. Gate is clamped from above only (gate * sigmoid → 0 for negative gate, so lower clamp is a no-op), up is symmetrically clamped — matching the SwiGLU-OAI numeric contract.
  • Oracle routing: Short-context EP8 (≤4096 max_model_len) → Mxfp8NativeTritonExperts; long-context EP8 → Mxfp8EmulationTritonExperts. Both guards are scoped to the exact profiled MiniMax M3 gfx94x shape. The bf16_weights_available flag prevents using uninitialized BF16 weights in long-context EP8 where they aren't retained.
  • Decode gating: The use_sparse_ep predicate in Mxfp8EmulationTritonExperts.apply correctly gates on model match, BF16 dtype, ≤256 tokens, SwiGLU activation, expert_map presence, no router-weight-on-input, and no LoRA. Prefill and mixed batches fall through to the generic TritonExperts path.
  • Benchmark script: EP patch application follows the existing idempotent marker pattern. EP flag is correctly conditional on EP_SIZE. No master config or perf-changelog changes are included (as documented in scope).

@Oseltamivir Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch from d1638a0 to 465ff47 Compare June 17, 2026 20:51
@Oseltamivir Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch 6 times, most recently from 95e79da to 27510c4 Compare June 17, 2026 21:47
@Oseltamivir Oseltamivir changed the title perf(vllm): compact MiniMax M3 EP decode routes on MI300X perf(vllm): compact MiniMax M3 block-FP8 EP routes on MI300X Jun 18, 2026
@Oseltamivir Oseltamivir changed the title perf(vllm): compact MiniMax M3 block-FP8 EP routes on MI300X perf(vllm): optimize MiniMax M3 inference on MI300X Jun 19, 2026
@Oseltamivir Oseltamivir force-pushed the codex/minimax-m3-mi300x-ep-mxfp8 branch from 2b449ab to 87e92e2 Compare June 19, 2026 07:30
@Oseltamivir

Copy link
Copy Markdown
Collaborator Author

Optimized MI300X-only sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27812712075

Matrix: c1, c16, and c256 for each of 1k1k and 8k1k (TP8 at c1/c16, EP8 at c256), using optimized commit 87e92e28. No eval or non-MI300X jobs.

@Oseltamivir Oseltamivir marked this pull request as draft June 19, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant